2 Steps to Data Analysis:
What is the research question?
What properties of the variables of primary interest?
2 Types of variables:
Problems that summarise one categorical variable and the association between two categorical variables are extremely similar in scope so we’ll cover both here.
The main tool is a table of frequencies (both one way for a single variable and two way for two variables)
One way table:
| Party | Liberal | Labor |
|---|---|---|
| 300 | 295 |
Two way table:
| Survived | Died | |
|---|---|---|
| Male | 142 | 709 |
| Female | 308 | 154 |
2 types:
Bar chart of frequencies \(\rightarrow\) 1 var
Clustered bar chart (of frequencies) \(\rightarrow\) 2 vars
DON’T USE PIE CHARTS U DUMB FUCKS
3 things to look at
Sample mean: \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \]
Sample variance: \[ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \]
Sample deviation: \[ s = \sqrt{s^2} \]
Sample median \[ \tilde{x}_{0.5} = \left\{ \begin{array}{l} x_{(\frac{n+1}{2})} \text{ if n is odd} \\ \frac{1}{2}(x_{(\frac{n}{2})} +x_{(\frac{n+2}{2})}) \text{ if n is even} \end{array} \right. \]
pth sample quantile:
\[ \tilde{x}_p = x_{(k)} \quad \text{where} \quad p = \frac{k-0.5}{n} \quad \text{for} \quad k \in \{1,2,3,\ldots,n\} \]
Inter-quartile Range: \[ IQR = \tilde{x}_{0.75} - \tilde{x}_{0.25} \]
Range based observations (IQR, median, ) are much less sensitive to outliers than other measures (mean, variance, sd)
Kernel density estimator: \[ \hat{f_h}(x) = \frac{1}{n} \sum_{i=1}^{n} w_h(x-x_i) \\ h \rightarrow \text{bandwidth parameter} \]
Here are some sample distributions in 3 different skews:
It’s also worth checking for outliers that can influence the shape of the data
correlation coefficient (2 quant vars):
\[ r = \frac{1}{n-1} \sum_{i=1}^{n} (\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \ \] where \(\bar{x}\) and \(s_x\) are the sample mean and standard deviation of \(x\),similarly for \(y\).
3 Types of result:
Where the second and third results are linear relationships between the two variables (negative and positive gradient)
2 measures:
Just use a comparative box plot smh
Linear Transformations take the shape of \[ y_i = a +bx_i \] for each i and b \(\neq\) 0
It doesn’t affect the shape of the distribution \(\rightarrow\) only the location and spread.
A common Linear transformation is the \(z\)-score or standardised score: \[ z = \frac{x-\bar{x}}{s_x} \]
It measures how many standard deviations above/below the value is from the mean (ie as \(|z| \rightarrow 1\)) the more unusual it is.
The most common Non-linear transformation is a log-transformation, it can reveal interesting relationships and structures for values that may seem too close together
Important Note: Let (y = h(x)) be some on linear transformation of real values x. In most cases: \[ \bar{y} \neq h(\bar{x}) \]
ie: the mean of the transform won’t be equal to the mean of the original data
A random variable is a variable that’s uncertain
Random Variables tend to occur within a sample space
Eg: \(S = \{HHH,HHT,HTH,THH,THT,HTT,TTT\}\)
Eg: Let \(X\) be the number of heads. Then, \[ Pr(X = 0) = \frac{1}{8}, \quad Pr(X = 1) = \frac{3}{8}, \quad Pr(X = 2) = \frac{3}{8}, \quad Pr(X = 3) = \frac{1}{8}. \]
A random variable is discrete if there are countably many values within the sample space ie: \(X \rightarrow Pr(X=x) >0 \)
The probability structure of the discrete random variable \(X\) is given by \[ f_X(x) = Pr(X=x) \]
It has the following properties: \[ f_X(x) \geq 0 \text{ for all } x \in \mathbb{R} \] and \[ \sum_{\text{all } x} f_X(x) = 1 \]
eg: the probabilities from the heads and tails example add to 1: \[ \sum Pr(X = x) = Pr(X = 0) + Pr(X = 1) + Pr(X = 2) + Pr(X = 3) = \frac{1}{8} + \frac{3}{8} + \frac{3}{8} + \frac{1}{8} = 1 \]
The analogue of a the probability function for continuous random variables is the density function
\[ \int_A f_X(x)dx = Pr(X \in A) \]
It has 2 similar properties:
\[ f_X(x) \geq 0 \text{ for all } x \in \mathbb{R} \]
and
\[ \int_{-\infty}^{\infty} f_X(x)dx = 1 \]
Therefore, for any continuous random variable \(X\) and a pair of numbers \(a \leq b\) we have
\[ Pr(a \leq X \leq b) = \int_{a}^{b} f_X(x)dx = \text{area under } f_X \text{ between } a \text{ and } b. \]
Hence, if you can derive \(f_X(x)\) you can derive any probability about \(X\) and hence any property of \(X\).
Continuous random variables \(X\) have the property \[ Pr(X = a) = 0 \text{ for any } a \in \mathbb{R} \]
Hence when we refer to them we use \(\leq, \geq\) as they are the same as \(\gt, \lt \).
The Cumulative Distribution Function (cdf) of the random variables \(X\) is \[ F_X(x) = Pr(X \leq x). \]
Eg coin toss example: \[ F_X(1) = \frac{1}{8} + \frac{3}{8} = 4/8 \] \[ F_X(2) = \frac{1}{8} + \frac{3}{8} + \frac{3}{8} = 7/8 \]
Hence: \[ Pr(a < X \leq b) = F_X(b) - F_X(a) \]
To calculate \(F_X(x)\) from \(f_X(x)\) can be done as:
\[ F_X(x) = \left\{ \begin{array}{l} \sum_{t \leq x} f_X(t) \quad \text{ if X is discrete} \\ \int_{-\infty}^{x}f_X(t)dt \quad \text{if X is continuous} \end{array} \right. \]
and vice-versa:
\[ f_X(x) = \left\{ \begin{array}{l} F_X(x - F_X(x\_) \quad \text{ if X is discrete} \\ F'_X(x) \quad \quad \quad \quad \quad \text{ if X is continuous} \end{array} \right. \] where \(F_X(x\_)\) is the limiting value of \(F_X(x)\) as we approach \(x\) from the negative direction.
A quantile is just a percentile.
If \(F_X\) is strictly increasing in some interval then \(F^{-1}_X\) is well defined and, for a specified \(p \in (0,1)\), the pth quantile of \(F_X\) is \(x_p\) where: \[ F_X(x_p) = p \text{ or } x_p = F^{-1}_X(p) \]
Expected value / mean of a discrete random variable: \[ E(X) = \sum_{\text{all } x} x \times Pr(X = x) = \sum_{\text{all } x } xf_X(x) \] Expected value / mean of a continuous random variable: \[ E(X) = \int_{-\infty}^{\infty} xf_X(x)dx \]
\[ E\{g(X)\} = \left\{ \begin{array}{l} \sum_{\text{all } x} g(x) f_X(x) \quad \; \; \text{if X is discrete} \\ \int_{-\infty}^{\infty} g(x) f_X(x)dx \quad \text{ if X is continuous} \end{array} \right. \]
rth moment of \(X\) about \(a\) defined as \(E\{(X-a)^r\}\)
\[ E(a+bX) = a+bE(X) \] for both continuous and discrete
\[ Var(X) = E\{(X-\mu)^2\} = E(X^2) - E(X)^2 \]
\[ sd = \sqrt{Var(X)} \]
Chebychev’s Inequality is a fundamental result concerning tail probabilities of general random variables. It is useful for derivation of convergence results given later.
\[ Pr(|X -\mu|) >k\sigma \leq \frac{1}{k^2} \]
where \(k > 0\) is a constant
It’s often stated as: “the probability that X is more than k standard deviations from its mean.”
Chebychev’s Inequality makes no assumptions about the distribution of X.
In some cases you can derive the distribution from first principles
For continuous random variables, this means attempting to derive an expression for cumulative probabilities \(F_X(x)\), then \(f_X(x) = F'_X(x)\)
For discrete X we have: \[ f_Y(y) = Pr(Y = y) = Pr(h(X) = y) = \sum_{x:h(x)=y} Pr(X = x) \]
For continuous random variable X, if h is monotonic over the set \(\{x : fX(x) > 0\}\), then
\[ f_Y(y) = f_X(x) |\frac{dx}{dy}| \\ \quad \quad \quad \qquad= f_X\{h^{-1}(y)\}|\frac{dx}{dy}| \] for \(y\) such that \(f_X\{h^{-1}(y)\} > 0\)
| Distribution | Type | Parameters | \(f_X(x)\) | Domain | \(E(X)\) | \(Var(X)\) | Uses |
|---|---|---|---|---|---|---|---|
| Bernoulli | Discrete | \(p\) | \(p^{x}(1-p)^{1-x}\) | \(\{0,1\}\) | \(p\) | \(p(1-p)\) | A single trial with two possible outcomes (Bernoulli trial) |
| Binomial - \(Bin(n,p)\) | Discrete | \(n,p\) | \(\binom{n}{x}p^{x}(1-p)^{n-x}\) | \(\{0,1,2,\dots,n\}\) | \(np\) | \(np(1-p)\) | Number of successes from n independent Bernoulli trials. |
| Geometric | Discrete | \(p\) | \(p(1-p)^{x-1}\) | \(\{1,\dots\}\) | \(\frac{1}{p}\) | \(\frac{1-p}{p^2}\) | Number of independent Bernoulli trials until first success. |
| Hypergeometric | Discrete | \(n,m,N\) | \(\frac{\binom{m}{x}\binom{N-m}{n-s}}{\binom{N}{n}}\) | \(\{0,1,\dots,min(m,n)\}\) | \(\frac{nm}{N}\) | \(\frac{nm}{N}(1-\frac{m}{N})\frac{N-n}{N-1}\) | Number of successes in a sample of size n from N items, of which m are successes. |
| Poisson | Discrete | \(\lambda\) | \(\frac{e^{-\lambda}\lambda^{x}}{x!}\) | \(\{0,1,2,\dots\}\) | \(\lambda\) | \(\lambda\) | Counting independent events(that have constant occurrence probability). |
| Exponential | Continuous | \(\beta\) | \(\frac{1}{\beta}e^{\frac{-x}{\beta}}\) | \(x > 0\) | \(\beta\) | \(\beta^2\) | Time between independent events (that have constant occurrence probability). |
| Uniform | Continuous | \(a,b\) | \(\frac{1}{b-a}\) | \(b > x > a\) | \(\frac{a+b}{2}\) | \(\frac{(b-a)^2}{12}\) | An event with constant probability within some interval (a,b). |
| Normal - \(N(\mu,\sigma^2)\) | Continuous | \(\mu,\sigma^2\) | \(\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\) | \(\infty > x > -\infty \) | \(\mu\) | \(\sigma^2\) | Useful for some variables (e.g. height) but mostly for inference. |
| Gamma - \(Gamma(\alpha,\beta)\) | Continuous | \(\alpha,\beta\) | \(\frac{e^{-x/\beta}x^{\alpha-1}}{\Gamma(\alpha)\beta^\alpha }\) | \(x > 0\) | \(\alpha\beta\) | \(\alpha\beta^2\) | Generalisation of the exponential. |
Observations are often taken in pairs \(\rightarrow\) one observation of two variables
Often we need to look at the relationship between 2 variables.
The probability that \(X = x \text{ and } Y = y\) \[ f_{X,Y}(x,y) = Pr(X=x,Y=y) \]
The joint density function of continuous random variables is a bivariate function with the property \[ \int \int_A f_{X,Y}(x,y)dxdy = Pr((X,Y) \in A) \]
for any subset \(A\) of \(\mathbb{R}^2\)
Figure out the area you want to integrate over
Insert the limits into the intagrals of the joint density function
Integrate with respect to x and Integrate with respect to y (using the limits)
Solve
Alterntaitvly you could integrate with respect to x and y and then substitute the values into the new, integrated equation
\[ f_X(x) = \sum_{\text{all } y} f_{X,Y}(x,y) \]
\[ f_Y(y) = \sum_{\text{all } x} f_{X,Y}(x,y) \]
Example:
\[ f_X(x) = \int^\infty_{-\infty} f_{X,Y}(x,y) dy \]
\[ f_Y(y) = \int^\infty_{-\infty} f_{X,Y}(x,y) dx \]
Firstly Recall Bayes Rule: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(A \cap B)}{P(B)}, \text{ if } P(B) \neq 0 \]
The following applications are just applications of this rule to discrete and continous probabilites ### Discrete
\[ f_{X|Y}(x|y) = Pr(X = x | Y = y) = \frac{Pr(X = x , Y = y)}{Pr(Y=y)} = \frac{f_{X,Y}(x,y)}{f_Y(y)} \]
The oppisite is also true for \(f_{Y|X}(y|x)\).
\[ Pr(Y \in A |X = x) = \sum_{y \in A} f_{Y|X}(y|X = x) \]
\[ f_{X|Y}(x|Y=y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} \] The oppisite is also true for \(f_{Y|X}(y| X = x)\).
\[ Pr(a \leq Y \leq b | X = x) = \int_a^b f_{Y|X}(y|x) dy \]
The conditional expected value of \(X\) given \(Y = y\) is \[ E(X|Y = y) = \left\{ \begin{array}{l} \sum_{\text{all } x} xPr(X=x | Y = y)\quad \text{if X is discrete} \\ \int_{-\infty}^{\infty} xf_{X|Y}(x|y)dx \quad \quad \quad \quad \;\text{if X is continuous} \end{array} \right. \]
Similarly,
\[ E(Y|X = x) = \left\{ \begin{array}{l} \sum_{\text{all } x} yPr(Y = y | X = x)\quad \text{if Y is discrete} \\ \int_{-\infty}^{\infty} yf_{Y|X}(y|x)dx \quad \quad \quad \quad \;\text{if Y is continuous} \end{array} \right. \]
\[ Var(X|Y=y) = E(X^2 | Y=y) - \{E(X| Y=y)\}^2 \]
Where: \[ E(X^2|Y = y) = \left\{ \begin{array}{l} \sum_{\text{all } x} x^2Pr(X=x | Y = y)\quad \text{if X is discrete} \\ \int_{-\infty}^{\infty} x^2f_{X|Y}(x|y)dx \quad \quad \quad \quad \;\text{if X is continuous} \end{array} \right. \]
This also applies to \(Var (Y|X =x) \)
A Varible is independent if and only if: \[ f_{Y|X}(y|x) = f_Y(y) \] or similarly
\[ f_{X|Y}(x|y) = f_X(x) \]
And hence
\[ F_{X,Y} = F_X(x) \times F_Y(y) \]
also hence
\[ E(XY) = E(X) \times E(Y) \\ \text{or more generally:} \\ E(g(X) \times h(Y)) = E\{g(X)\} \times E\{h(Y)\} \]
\[ Cov(X,Y) = E\{(X-\mu_X)(Y-\mu_Y)\} \\ \text{where } \mu_X = E(X) \text{ and } \mu_Y = E(Y) \]
The covarince measures how \(X\) and \(Y\) vary together lineraly. If it’s \(> 0\) then \(X\) and \(Y\) are positively associated.(ie: They move along their axis the same way \(\rightarrow\) if \(X\) is big then \(Y\) will be big).
The inverse is true for \(< 0\) \(\rightarrow\) if \(X\) is small then \(Y\) will be big).
Here are 2 more results from the covariance:
\[ Cov(X,X) = Var(X) \] this one is kinda self explonatry
\[ Cov(X,Y) = E(XY) - \mu_X\mu_Y = E(XY) -E(X)E(Y) \]
X and Y will be independent if \(Cov(X,Y) = 0\)
Covariance also comes into play when finding bivariate variance transforms:
\[ Var(aX+bY) = a^2Var(X) + 2abCov(X,Y) + b^2Var(Y) \] Hence:
\[ Var(X+Y) = Var(X) + 2Cov(X,Y) + Var(Y) \]
If \(X\) and \(Y\) are independent: \[ Var(X+Y) = Var(X) + Var(Y) \]
\[ Var(X-Y) = Var(X) + Var(Y) \]
\[ Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X) \times Var(Y)}} \]
The correlation measures the strength of the relationship between \(X\) and \(Y\).
If \(Corr(X,Y) = 0\) then the two varibles are uncorrelated.
Correlation will always lie between -1 and 1. As it moves away from zero the relationship becomes stronger. When Correlation reaches -1 or 1 the to variables are linearly correlated meaning that they can be expressed in the form \(Y = a+bX\).
Where \[ p = Corr(X,Y) \]
Below is a 3d bivariate normal distribution. Stolen from here
Todo: FINISH CHAPTER
The way we collect data can affect how we conduct our anaylsis. Data is basically never going to be exactly how we want it from the jump, so we often need to change (by sampling) how our data and conduct experiments to minimise data loss.
When collecting data we need to ensure that it’s representative and random. This is because when we want to make accurate predictions about the larger populations.
A sample is said to be representative if:
\[ f_{X_i}(x) = f_X(x) \text{ for each } i. \]
REPRESENTATIVENESS IS MORE IMPORTANT THAN SAMPLE SIZE. IT IS BETTER TO HAVE A SMALL BUT REPRESENTATIVE SAMPLE THAN A LARGE BUT UNREPRESENTATIVE SAMPLE.
A random sample of size \(n\) is a set of of random variables that are independent and have the same probability distribution.
A simple random sample is a method that samples without replacement in which every element of the sample space has an equally likly probability of being sampled.
In R this looks like
# 30 Values in the normal distribution
x <- rnorm(30)
# sample 10 values from that distribution
sample(x, 10)
## [1] -0.2151128 -0.3356333 1.0256975 0.5306271 0.1870914 1.8907001
## [7] 0.5010078 0.6518803 1.3503825 -1.0070247
Suppose that \(X\) and \(Y\) are independent Random Varibles (that are non negative) and let (Z = X +Y)
Then for the discrete case: \[
f_Z(z) = \sum_{y=0}^{z} f_X(x-y)f_X(y), \quad z=0,1,\dots
\]
for the continous case \[ f_Z(z) = \int_{\text{all possible } y} f_X(x-y)f_X(y)dy \]
A moment generating function of a random variable \(X\) is:
\[ m_X(u) = E(e^{uX}) \]
In general: \[ E(X^r) = m^{(r)}_X (0) \text{ for } r = 0,1,2,\dots \] Where \(m^{(r)}_X\) is the \(r\)th deriative of \(m_X(u)\).
\[ m_{X+Y}(u) = m_X(u)m_Y(u) \]
or more generally \[ m_{\sum^{n}_{i = 1}}X_i(u) =\prod_{i=1}^{n}m_{X_{i}}(u) \]
and for an avg \[ m_{\sum^{n}_{i = 1}}X_i(u) =\prod_{i=1}^{n}m_{X_{i}}(\frac{u}{n}) \]
\[ bias( \hat{\theta}) = E(\hat{\theta}) - \theta \]
\[ se(\hat{\theta}) = \sqrt{Var_{\hat{\theta}}(\hat{\theta})} \]
\[ MSE(\hat{\theta}) = E\{(\hat{\theta} - \theta)^2\} \]
test test